Locating and Reconfiguring Records in Unstructured Multiple-Record Web Documents

نویسندگان

D. W. Embley

L. Xu

چکیده

Record extraction from data-rich, unstructured, multiplerecord Web documents works well [9], but only if the text for each record can be located and isolated. Although some multiple-record Web documents present records as contiguous, delineated chunks of text (which can thus be located and isolated [10]), many do not. When some values of textual records are factored out, are split unnaturally across boundaries, are joined unnaturally within boundaries, or are linked by off-page connectors, or when desired records are interspersed with records that are not of interest, it is difficult to automatically cull records and piece values together to form clean, delineated chunks of text that each represent a single record of interest. In this paper we address this problem and propose an algorithm to find and rearrange (if necessary) records in an HTML document. The essential idea is to attempt to maximize a record-recognition heuristic with respect to a given application ontology. Tests we conducted for two widely differing applications show that this technique properly locates and reconfigures records.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Record Location and Reconfiguration in Unstructured Multiple-Record Web Documents

Record extraction from data-rich, unstructured, multiplerecord Web documents works well [8], but only if the text for each record can be located and isolated. Although some multiple-record Web documents present records as contiguous, delineated chunks of text (which can thus be located and isolated [9]), many do not. When some values of textual records are factored out, are split unnaturally ac...

متن کامل

Automatic Location and Separation of Records: A Case Study in the Genealogical Domain

Locating specific chunks (records) of information within documents on the web is an interesting and nontrivial problem. If the problem of locating and separating records can be solved well, the longstanding problem of grouping extracted values into appropriate relationships in a record structure can be more easily resolved. Our solution is a hybrid of two well established techniques: (1) ontolo...

متن کامل

Recognizing Ontology-Applicable Multiple-Record Web Documents

Automatically recognizing which Web documents are “of interest” for some specified application is non-trivial. As a step toward solving this problem, we propose a technique for recognizing which multiple-record Web documents apply to an ontologically specified application. Given the values and kinds of values recognized by an ontological specification in an unstructured Web document, we apply t...

متن کامل

Adaptive Approximate Record Matching

Typographical data entry errors and incomplete documents, produce imperfect records in real world databases. These errors generate distinct records which belong to the same entity. The aim of Approximate Record Matching is to find multiple records which belong to an entity. In this paper, an algorithm for Approximate Record Matching is proposed that can be adapted automatically with input error...

متن کامل

Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages

Electronically available data on the Web is exploding at an ever increasing pace. Much of this data is unstructured, which makes searching hard and traditional database querying impossible. Many Web documents, however, contain an abundance of recognizable constants that together describe the essence of a document’s content. For these kinds of data-rich, multiple-record documents (e.g. advertise...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2000

Locating and Reconfiguring Records in Unstructured Multiple-Record Web Documents

نویسندگان

چکیده

منابع مشابه

Record Location and Reconfiguration in Unstructured Multiple-Record Web Documents

Automatic Location and Separation of Records: A Case Study in the Genealogical Domain

Recognizing Ontology-Applicable Multiple-Record Web Documents

Adaptive Approximate Record Matching

Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages

عنوان ژورنال:

اشتراک گذاری